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Adult speech perception reflects the long-term regularities of the native language, but 
it is also flexible such that it accommodates and adapts to adverse listening conditions 
and short-term deviations from native-language norms. The purpose of this article is 
to examine how the broader neuroscience literature can inform and advance research 
efforts in understanding the neural basis of flexibility and adaptive plasticity in speech 
perception. Specifically, we highlight the potential role of learning algorithms that rely on 
prediction error signals and discuss specific neural structures that are likely to contribute 
to such learning. To this end, we review behavioral studies, computational accounts, 
and neuroimaging findings related to adaptive plasticity in speech perception. Already, a 
few studies have alluded to a potential role of these mechanisms in adaptive plasticity 
in speech perception. Furthermore, we consider research topics in neuroscience that 
offer insight into how perception can be adaptively tuned to short-term deviations while 
balancing the need to maintain stability in the perception of learned long-term regularities. 
Consideration of the application and limitations of these algorithms in characterizing 
flexible speech perception under adverse conditions promises to inform theoretical 
models of speech. 

Keywords: perceptual learning, plasticity, supervised learning, cerebellum, language, prediction error signals, 
speech perception 



Spoken language is conveyed by transient acoustic signals with 
complex and variable structure. Ultimately, the challenge of 
speech perception is to map these signals to representations (e.g., 
pre-lexical and lexical knowledge) of an individual's native lan- 
guage community. In real-world environments, this challenge is 
frequently exacerbated under adverse listening conditions aris- 
ing from noisy listening environments, hearing impairment, or 
speech that deviates from long-term speech regularities due to 
talkers' accents, dialects or speech disorders. In circumstances 
where adverse conditions lead to systematic short-term devi- 
ations from the long-term regularities of a language, a lis- 
tener can rapidly adjust the mappings from acoustic input 
to long-term knowledge. However, little is known about the 
mechanisms underlying adaptive plasticity in speech perception. 
Understanding such rapid adaptive plasticity may provide insight 
into how the perceptual system deals with adverse listening sit- 
uations. Although there has been recent interest in investigating 
adaptive plasticity in speech perception, these studies have used 
different tasks and methodologies and remain mostly uncon- 
nected. It is one of the goals of this paper to review these findings 
and integrate the results within a potentially common framework. 

To this end, we examine a number of factors that influence 
adaptive plasticity in speech perception and review behavioral, 
computational, and functional neuroimaging studies that have 



contributed to our current understanding of adaptive processes. 
In reviewing these mostly separate strands of research, we take the 
view that examining candidate neural systems that may under- 
lie the behavioral changes could reveal a unifying framework 
for understanding how adaptive plasticity is achieved. We draw 
from domains outside of speech perception to consider super- 
vised learning relying on sensory prediction error signals as a 
potential mechanism for uniting seemingly distinct behavioral 
speech perception phenomena. From this perspective, we pro- 
pose that understanding the neural basis of adaptive plasticity in 
speech perception will require integrating subcortical structures 
into current frameworks of speech processing, which until now 
have largely focused on the cerebral cortex. Specifically, we exam- 
ine the possibility that subcortical-cortical interactions may form 
functional networks for driving plasticity. 

INSIGHTS FROM BEHAVIORAL STUDIES 

We first examine two distinct behavioral literatures, each demon- 
strating adaptive changes in speech perception in response to 
signal distortions. One set of studies investigates improvements 
in spoken word recognition following experience with distorted 
signals. The other examines changes in acoustic phonetic percep- 
tion following experience with distorted input in disambiguating 
contexts. Both sets of studies show changes at early stages of 
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speech processing, which are facilitated by disambiguating con- 
textual sources of information (e.g., lexical information). Across 
different studies and tasks, perceptual effects showing adap- 
tive changes in speech perception have been variously termed 
"perceptual learning," "adaptation," "recalibration," and "retun- 
ing," with the choice of descriptor driven mostly by the associ- 
ated task. Here, we use "adaptive plasticity" as a broader term 
to be inclusive of distinct literatures and different tasks that 
may tap into some of the same processes in adjusting speech 
perception to accommodate short-term deviations in speech 
acoustics. 

ADAPTIVE PLASTICITY IN WORD RECOGNITION TASKS 

Adults rapidly and effortlessly extract words from fluent speech. 
However, adverse listening conditions can affect the quality and 
reliability of the acoustic speech signal and negatively impact 
word recognition, reducing intelligibility (for review see Mattys 
et al., 2012). Under certain circumstances, brief experience with 
the adverse listening condition results in intelligibility improve- 
ments (e.g., Pallier et al, 1998; Liss et al., 2002; Clarke and Garrett, 
2004; Bradlow and Bent, 2008). For example, several studies have 
shown that brief familiarization with natural foreign-accented 
speech can improve intelligibility of the accented talker (e.g., 
Clarke and Garrett, 2004; Bradlow and Bent, 2008) and, under 
some circumstances, generalize to intelligibility improvements for 
speech from other talkers with the same native language back- 
ground (e.g., Bradlow and Bent, 2008). Such adaptive plasticity is 
observed across many acoustic speech signal distortions including 
synthesized text-to-speech (Schwab et al., 1985; Greenspan et al., 
1988; Francis et al, 2000) dysarthric speech (Liss et al, 2002), 
and speech in noise (e.g., Cainer et al., 2008). It is also observed 
with more synthetic manipulations of the speech signal such as 
noise vocoding (Davis et al., 2005), spectral shifting (e.g., Fu and 
Galvin, 2003), and time compression (e.g., Altmann and Young, 
1993). Many of these experimental manipulations relate to com- 
monly occurring natural adverse listening experiences and some 
are intended to mimic the degraded experiences encountered 
by listeners with hearing deficits or cochlear implants. Overall, 
there is widespread evidence that intelligibility of distorted speech 
input improves with relatively brief experience or training across 
many different types of signal distortion. 

Though the flexibility of perception under a variety of adverse 
listening conditions indicates the robustness of adaptive plasticity 
in speech perception, the use of different stimulus manipula- 
tions and different types of training and experience across studies 
makes it difficult to build an integrative model. However, several 
key characteristics merit special attention. One significant char- 
acteristic of studies in this literature is the supportive influence of 
information that disambiguates the acoustics of distorted words. 
This information may originate from external feedback indicating 
the appropriate interpretation of the signal. For example, intelli- 
gibility is improved when a distorted acoustic word is paired with 
the written form of the word during the initial presentation (e.g., 
Fu and Galvin, 2003) or following the response (e.g., Greenspan 
et al., 1988; Francis et al., 2000, 2007), and when the clear undis- 
torted version of the signal precedes the distorted signal during 
training (Hervais-Adelman et al., 2008). Each of these approaches 



provides the speech system with information to support mapping 
the distorted speech signal to linguistic knowledge. 

Adaptive plasticity in speech perception can even occur with- 
out explicit feedback. Mere exposure to nonnative-accented 
speech results in improvements in performance in the absence 
of explicit feedback or other explicit information about the cor- 
rect interpretation (e.g., Altmann and Young, 1993; Mehler et al., 
1993; Sebastian-Galles et al, 2000; Liss et al, 2002). Simply lis- 
tening to time-compressed speech (Altmann and Young, 1993; 
Mehler et al., 1993; Sebastian-Galles et al, 2000) or natural 
speech from dysarthric patients (Liss et al, 2002) can lead to 
intelligibility improvements. Likewise, experience with distorted 
sentences containing real target words improves recognition of 
subsequent distorted sentences to a greater degree than experi- 
ence with target nonwords (Davis et al., 2005). These findings 
suggest that internally generated lexical information may also 
contribute to adaptive plasticity. In sum, information that sup- 
ports the disambiguation of speech, including externally provided 
information and internally generated lexical information, may 
promote adaptive plasticity (Davis et al., 2005; Hervais-Adelman 
et al, 2008). 

A second significant characteristic of adaptive plasticity is that 
when external sources of information are unavailable to resolve 
the ambiguity of the distorted acoustic signal, the degree of 
adaptation appears to be dependent on the severity of the dis- 
tortion (Bradlow and Bent, 2008; Li et al, 2009). For example, 
listeners show greater adaptation to relatively more intelligible 
foreign accented speech (Bradlow and Bent, 2008). Other studies 
have shown that adaptive plasticity is difficult for severe artificial 
speech distortions (e.g., Li and Fu, 2006, 2010), whereas grad- 
ually increasing the severity of the distortion (Guediche et al., 
2009) or intermixing less severely distorted signals with more 
severe distortions (Li et al., 2009) facilitates adaptation. Indeed, 
adaptive plasticity can be readily observed for time-compressed 
speech, even without external feedback (e.g., Pallier et al., 1998). 
This may be because the degree of time compression generally 
used tends to result in more intelligible distortions (often 50- 
60% intelligibility or greater) (e.g., Pallier et al., 1998; Peelle and 
Wingfield, 2005; Adank and Janse, 2009) in comparison to text- 
to-speech or noise-vocoded speech distortions (e.g., Schwab et al, 
1985; Francis et al, 2000; Davis et al, 2005; Hervais-Adelman 
et al., 2011) that typically employ feedback to promote adaptive 
plasticity. 

A third key characteristic of adaptive plasticity is that improve- 
ments in intelligibility as a consequence of experience generalize 
to words not encountered during a training or exposure period 
(Schwab et al, 1985; Francis et al., 2000, 2007). In fact, in many 
studies all the words in the experiment are unique. Therefore, 
even though lexical knowledge can mediate adaptive plasticity by 
disambiguating the distorted signals (Davis et al., 2005), adap- 
tive change must occur in the mapping of the distorted sounds to 
pre-lexical representations and not in the mapping from speech 
acoustics to any particular lexical item. 

Overall, studies of adaptive plasticity in word recognition 
employ multiple stimulus distortions and various approaches to 
delivering experience with the speech distortion. Experience with 
the speech distortions can lead to improvements in intelligibility, 
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through adaptive processes that retune the mapping of the dis- 
torted acoustic speech input to the speech processing system. 
The remapping seemingly plays out at an early stage of percep- 
tion (e.g., pre-lexical). This adaptive plasticity is facilitated by 
the availability of disambiguating external information (such as 
explicit feedback or corresponding clear and undistorted speech), 
and also by signals that are relatively less distorted and, therefore, 
more intelligible. Disambiguating information and baseline intel- 
ligibility may have their influence on adaptive plasticity through 
a common means: each may impact the relative accuracy with 
which the distorted acoustics are mapped to established long- 
term regularities of the native language. 

If both externally-provided and internally-generated infor- 
mation contribute to adaptive plasticity, the impact of external 
feedback on adaptive plasticity is likely to be greater for less intelli- 
gible signal distortions compared to more intelligible distortions. 
Indeed, when distortion intelligibility and the presence of external 
feedback are independently manipulated, the two factors interact 
to modulate the degree of adaptive plasticity observed (Guediche 
et al, 2009). Intelligibility serves as a metric for the accuracy with 
which listeners can map distorted signals to lexical knowledge. 
Greater intelligibility thus indicates greater success in mapping 
distorted acoustics, which may produce internal signals to guide 
adaptive plasticity that are less reliable or less available when 
intelligibility is low. In this latter case, external information that 
supports accurate mapping may serve to drive adaptive plasticity. 
We return to the implications of this possibility below. 

ADAPTIVE PLASTICITY IN ACOUSTIC PHONETIC PERCEPTION 

Adaptive plasticity has also been shown in other speech tasks that 
examine acoustic phonetic perception. Acoustic phonetic per- 
ception involves a complex mapping of acoustic speech signals 
that vary along multiple, largely continuous acoustic dimen- 
sions to long-term representations that respect the regularities 
of the native language (e.g., phonemes, words). This mapping 
is complicated by the fact that even when measured in quiet, 
well-controlled laboratory conditions, the acoustics conveying a 
particular phoneme or word are highly variable (e.g., Peterson 
and Barney, 1952). 

Under adverse conditions more typical of natural listening 
environments, there are short-term deviations in speech acoustics 
introduced by sources like foreign accent, dialect, noise, differ- 
ent speakers, and speech disorder. These systematic deviations 
can distort the acoustic speech signal. A listener may encounter 
a native Spanish talker referring in English to a fish using a vowel 
with acoustics more typical of English HI (a feesh) than III. The 
same listener might also encounter a native Pittsburgh talker chat- 
ting about the local football team, the Steelers, in the local dialect 
that produces English HI with acoustics more typical of III (the 
Stillers). Listeners would have little difficulty in either case as the 
perceptual system flexibly adjusts to such signal distortions. 

A broad research literature with a long history demonstrates 
that ambiguous speech signals can be resolved using many 
sources of contextual information. Acoustic (Lotto and Kluender, 
1998; Holt, 2005), lexical (Ganong, 1980), visual (McGurk and 
Macdonald, 1976; MacDonald and McGurk, 1978), and sen- 
tence contexts (Ladefoged and Broadbent, 1957), among others, 



each play a role in disambiguating speech signals. A sound with 
ambiguous acoustics between /g/ and fkj is more likely to be 

perceived as Ikl in the context of iss (kiss is a real word, 

giss is not), but as Igl in the context of ift (Ganong, 1980). 

Similarly, an ambiguous sound between fbl and Id/ can be dis- 
ambiguated by watching a video of a face articulating lb/ vs. 
Id/ (Bertelson et al., 1997). Relevant to the adaptive plasticity 
literature, repeated exposure to an ambiguous acoustic speech 
signal in a disambiguating context affects later perception of 
the ambiguous speech — even in the absence of a biasing con- 
text (Norris et al., 2003; Vroomen et al., 2007). This suggests an 
adaptive change in the way the ambiguous speech acoustics are 
mapped that remains even when the biasing context is no longer 
available. 

Two such biasing contexts have been explored extensively, lex- 
ical context and visually-presented articulating faces (Bertelson 
et al., 2003; Norris et al, 2003; Vroomen et al, 2007). Lexically- 
mediated changes in acoustic phonetic perception can be 
achieved by exposing listeners to ambiguous speech sounds 
embedded in lexical contexts that only produce a valid lexical 
item for one of the phonemes (e.g., Norris et al., 2003; Kraljic and 
Samuel, 2005; Maye et al., 2008; for review see Samuel and Kraljic, 
2009 for review). For example, when an acoustically- ambiguous 
sound between Is/ and [/] is presented in contexts for which only 
Isl completes a real word (e.g., legacy, Arkansas), lexical knowl- 
edge provides a means of disambiguating the sound (Ganong, 
1980). This experience affects subsequent [s]-[/] perception such 
that the acoustically-ambiguous [s]-[/] sound is more broadly 
accepted as [s] following exposure to [s] -consistent lexical con- 
texts than following exposure to [ /] -consistent contexts (e.g., 
pediatrician; Kraljic and Samuel, 2005). This effect is observed 
even when the lexically-biasing context is no longer present. Many 
experiments have demonstrated such lexical tuning of acoustic 
phonetic perception across phonemes, languages, and talkers in 
adults (see for review Samuel and Kraljic, 2009) and even among 
6- and 12-year-old children (McQueen et al., 2012). 

Exposure to visual information from an articulating face that 
disambiguates an ambiguous speech sound produces similar 
changes in acoustic phonetic perception. Bertelson et al. (2003) 
examined phonetic perception of an acoustically-ambiguous 
/aba/ and /ada/. Following exposure to the ambiguous token 
paired with a video of a face clearly articulating /aba/, subsequent 
perception of the ambiguous /aba/-/ada/ stimuli was shifted as 
acoustic information consistent with /aba/. 

Although lexical and visually-mediated adaptive plasticity have 
been most studied to date, other factors can also drive adaptive 
plasticity. Phonotactic probabilities (Cutler, 2008) and statistical 
regularities experienced across multiple tokens of speech exem- 
plars (Clayards et al., 2008; Idemaru and Holt, 2011) can also 
result in adaptive plasticity. In the latter example, correlations 
among acoustic cues provide a disambiguating source of infor- 
mation for how acoustic dimensions relate to one another in 
signaling phonemes (Idemaru and Holt, 2011). These findings 
are consistent with a rich literature demonstrating that listeners 
make use of many sources of information to disambiguate inher- 
ently ambiguous acoustic speech input. The literature on adaptive 
plasticity extends these observations by demonstrating that upon 
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repeated exposure, the effects of a disambiguating context can 
remain even in the absence of context. 

Clarke-Davidson et al. (2008) argue that data demonstrating 
adaptive plasticity in acoustic phonetic perception are best fit by 
modeling adaptation at the level of perceptual (pre-lexical) pro- 
cessing rather than at a subsequent decision level. In general, the 
nature of this pre-lexical influence is to more broadly accept the 
ambiguous acoustics as consistent with the biasing context. In 
other words, the adaptive adjustments of acoustic-phonetic per- 
ception are in the direction of the disambiguating (lexical, visual, 
statistical) contexts. In this way, adaptive plasticity in acoustic 
phonetic perception bears resemblance to adaptive plasticity in 
word recognition reviewed above. Specifically, both examples of 
adaptive plasticity show that contextual information (e.g., lexical 
information) can drive changes in perception at a pre-lexical level. 

SUMMARY 

Two largely independent strands of research demonstrate rapid 
adaptive changes in the mapping of distorted acoustic speech sig- 
nals. They have evolved in parallel, kept distinct primarily along 
paradigmatic lines, with little cross-talk (although Norris et al. 
(2003), Cutler (2008), and Samuel (2011) note commonalities). 
Motivated by results across these studies that show similarities, 
such as the contributions of both internal (e.g., lexical) and exter- 
nal (e.g., feedback) information sources, a common pre-lexical 
locus, and a similar influence of the severity of the acoustic dis- 
tortion on the degree of adaptation, we explore the possibility 
that these commonalities reflect common mechanisms. We first 
review computational modeling efforts that account for adaptive 
plasticity, and then turn to cognitive neuroscience and neuro- 
science research in other domains for further insights. 

INSIGHTS FROM COMPUTATIONAL MODELING 

Computational models assist in understanding adaptive plasticity 
by explicitly modeling outcomes of potential learning algorithms 
and relating these outcomes directly to behavioral evidence. 
Traditional computational models of speech perception are gen- 
erally defined by hierarchically-organized layers that represent 
linguistic information at different levels of abstraction (e.g., per- 
ceptual/ featural, pre-lexical, lexical). Two classes of hierarchical 
models — feedforward models (e.g., Norris, 1994; Norris et al., 
2000) and interactive models (e.g., McClelland and Elman, 1986; 
Gaskell and Marslen-Wilson, 1997) — have been especially influ- 
ential and each has provided an account of rapid adaptive plastic- 
ity, specifically focusing on lexically-mediated adaptive plasticity 
as measured by changes in acoustic phonetic perception (e.g., 
Norris et al., 2003). In the interactive model Hebb-TRACE, an 
unsupervised learning algorithm, Hebbian learning, is used to 
modify connection weights (Mirman et al, 2006), whereas a 
supervised learning algorithm (backpropagation) is proposed in 
the context of the feedforward MERGE model (Norris et al., 
2003). 

One influential debate between feedforward and interactive 
accounts is the degree to which different levels interact with one 
another. In feedforward modes like MERGE, there is no direct 
feedback from lexical representations to influence online speech 
perception. Thus, in contrast to interactive models, adaptive 



plasticity arises from feedback that is dedicated only for the 
purpose of learning. Norris et al. (2003) propose that in this 
case, feedback from lexical to pre-lexical levels is used to derive 
an error signal that indicates the degree to which there is a 
discrepancy between the expected phonological representation 
activated by the lexical item and the one indicated by the acoustic 
speech signal. They propose backpropagation, first instantiated 
by Rumelhart et al. (1986), as an implementation of super- 
vised learning to produce adaptive plasticity. Backpropagation 
uses error signals to drive changes in the weights of connections 
between the input speech signal and the pre-lexical informa- 
tion to reduce the discrepancy. Because the pre-lexical units 
mediate mapping between acoustic input and lexical knowledge, 
generalization to new words also results. While backpropaga- 
tion provides a supervised learning mechanism that may capture 
the rapid nature of the observed behavioral effects, it is not 
neurobiologically plausible (Crick, 1989). 

Hebb-TRACE (Mirman et al, 2006) is a modification of the 
interactive TRACE model (McClelland and Elman, 1986) that 
has an added Hebbian learning algorithm. It models adaptive 
plasticity via adjustments in the weights mapping from input 
to pre-lexical representations. Lexical activation results in direct 
excitatory feedback from the lexical layer to pre-lexical infor- 
mation consistent with the word. Processing of a perceptually 
ambiguous sound (e.g., with acoustics between Is/ and ///) leads 
to partial activation of both consonants with lateral inhibitory 
within-level connections leading to competition between the two 
alternatives at the pre-lexical level. The biasing lexical context 
(e.g., legacy, Arkansas) increases the activation of the congru- 
ent phoneme (Is/) through direct excitatory feedback, granting 
it an advantage over the partially-activated ///.To achieve adap- 
tive plasticity, the mapping of lower-level perceptual information 
to phonetic categories is adjusted via Hebbian learning such 
that subsequent perception of these consonants is more likely 
to activate the consonant consistent with the previous lexical 
context, even in the absence of the biasing context. By this 
account, the same lexical feedback that influences online acous- 
tic phonetic perception also guides learning of the mapping of 
distorted speech onto pre-lexical representations. A difficulty for 
this account is its time course. Whereas adaptive plasticity effects 
can require as few as 10-20 trials to evoke, Hebbian learning 
has a much slower time course for learning (Norris et al., 2003; 
Vroomen et al., 2007). 

Although the focus of traditional computational accounts has 
been on modeling the effect of lexical information on acoustic 
phonetic perception, the proposed learning mechanisms may be 
capable of accounting for adaptation to distorted speech input of 
the sort observed in the word recognition literature. Norris et al. 
(2003) explicitly make the connection between the mechanisms 
involved in adaptive plasticity of acoustic phonetic perception and 
those that underlie improvements in word recognition. The pro- 
posed mechanisms for lexically- guided adaptive plasticity in both 
MERGE and Hebb-TRACE also could be extended to accounts 
of other types of lexically-mediated adaptive plasticity and effects 
of other linguistic information at other higher levels of linguistic 
abstraction [e.g., sentence context (Borsky et al., 1998)], or dif- 
ferent modalities (e.g., visual information Vroomen et al, 2007). 
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Nonetheless, to this point, these disparate strands of research have 
not been integrated and there have been few attempts to exam- 
ine whether it may be possible to unite different phenomena of 
adaptive plasticity in speech perception on mechanistic grounds. 

A UNIFYING PERSPECTIVE? 

The behavioral and modeling literatures that investigate adap- 
tive plasticity in speech processing have distinct approaches that 
make it challenging to draw direct comparisons. However, evalu- 
ating them together reveals that there are a few observations any 
account of adaptive plasticity must address. One is that infor- 
mation that disambiguates distorted or otherwise perceptually- 
ambiguous acoustic speech input rapidly adjusts the way that the 
system maps speech input at a pre-lexical level, such that later 
input is less ambiguous even when disambiguating information 
is no longer present to support interpretation. Long-term knowl- 
edge, external feedback, and overall intelligibility of the distorted 
input each seem to play a role in modulating the extent to which 
adaptive plasticity is observed. 

A common feature among different forms of disambiguating 
information may be that they each provide a basis for gen- 
erating predictions. This characteristic relates to recent work 
suggesting that predictive coding may be a useful framework for 
understanding speech processing. To this end, we use predic- 
tive coding as an illustrative approach for considering adaptive 
plasticity. Predictive coding models capitalize on the reciprocal 
connections between different levels of a hierarchically organized 
structure and provide a way for generating predictions from 
externally-provided context or from internally-accessed informa- 
tion induced by the stimulus itself (Bastos et al., 2012; Panichello 
et al., 2012). The idea is that feedback from higher levels in the 
hierarchical speech processing structure can modulate activity 
in lower levels. These predictions are compared with the actual 
sensory input such that any discrepancies result in an internally- 
generated prediction error signal. This error signal, in turn, drives 
adaptive adjustments of the internal prediction to improve align- 
ment of future predictions with incoming input. Although there 
is still debate regarding the role of different sources of feedback in 
online perception compared to adaptive plasticity (Norris et al., 
2000; McClelland et al., 2006), the generation of predictions and 
prediction error signals may be common to both processes. 

In the domain of adaptive plasticity for acoustic phonetic 
perception, Vroomen and colleagues suggested that "crossmodal 
conflict" is responsible for driving rapid changes in perception 
and noted the possibility that it provides a common mecha- 
nism for both lexically-mediated and visually-mediated adaptive 
plasticity (Vroomen et al., 2007; Vroomen and Baart, 2012). 
They argued that in both cases, a discrepancy (i.e., error signal) 
between the information provided by different sources of infor- 
mation (lexical, visual) and the information provided by the input 
sensory modality (ambiguous acoustic speech signal) leads to 
adaptive plasticity. Bertelson et al. (2003); Vroomen et al. (2007), 
Vroomen and Baart (2012) also noted the intriguing similarities 
between adaptive plasticity in speech perception and sensorimo- 
tor adaptation, such as is observed for adapting movements while 
wearing visually-distorting prism goggles, Martin et al., 1996b). 
Namely, each depends on discrepancies between expected and 



actual sensory outcomes. Although Vroomen et al.'s analogy has 
been rarely linked to the supervised learning algorithms that are 
posited as a mechanism of adaptive plasticity in the MERGE 
model (Norris et al, 2003), it is strikingly similar. Dependence on 
discrepancies between expectations of the input as a result of lex- 
ical activation and the actual activation from the input form the 
basis of prediction error signals of supervised learning for adap- 
tive plasticity and also relate closely to mechanisms attributed to 
sensorimotor adaptation in literatures outside of speech percep- 
tion (see Wolpert et al, 2011 for review). Thus, consideration 
of the mechanisms underlying prediction error signals, gener- 
ally, and sensorimotor adaptation, more specifically, may reveal a 
rapid and biologically-plausible neural mechanism for achieving 
adaptive plasticity in speech perception. 

INSIGHTS FROM COGNITIVE NEUR0SCIENCE 
NEUR0IMAGING EVIDENCE FOR PREDICTIVE CODING IN SPEECH 
PERCEPTION 

Although neuroanatomical models of speech perception differ 
in their details, the general consensus is that there are two or 
more hierarchically-organized streams that diverge from pos- 
terior superior temporal cortex (Hickok and Poeppel, 2007; 
Rauschecker, 2011). The popular dual-stream model by Hickok 
and Poeppel (2007) suggests a ventral stream that supports access 
to meaning and combinatorial processes, and a dorsal stream that 
supports access to articulatory processing. In the ventral stream, 
more posterior areas of temporal cortex are involved in perceptual 
and lower levels of speech processing, whereas more anterior tem- 
poral cortical regions are involved in more abstract higher levels 
of language processing (Hickok and Poeppel, 2007; Rauschecker 
and Scott, 2009; DeWitt and Rauschecker, 2012). In particular, 
superior temporal areas are recruited for sensory-based percep- 
tual processes, posterior middle and inferior temporal areas are 
engaged in lexical and semantic processes, and anterior supe- 
rior and middle temporal areas are involved in comprehension 
(Binder et al., 2004, 2009; Scott, 2012). Supporting evidence for a 
posterior (responding earlier) to anterior (responding later) ven- 
tral processing stream in temporal cortex comes from a variety 
of neuroimaging methodologies and analyses (e.g., Gow et al., 
2008; Leff et al, 2008; Sohoglu et al., 2012). In the dorsal stream, 
parietal areas have been implicated in sensorimotor processing 
and frontal areas in articulatory processing. However, there is also 
evidence for parietal involvement in other aspects of speech pro- 
cessing including semantic and conceptual processes (e.g., Binder 
et al., 2009; Seghier et al., 2010), lexical and sound categoriza- 
tion (e.g., Blumstein et al, 2005; Rauschecker, 2012). Similarly, 
other functions have been attributed to frontal areas, such as 
suggestions that the inferior frontal gyrus (BA44/45) is engaged 
in syntactic and executive processes (Caplan, 1999, 2006; Binder 
et al., 2004; Fedorenko et al., 2012). Nonetheless, the view that 
multiple hierarchically organized neural streams support differ- 
ent aspects of perception has been established as a framework 
for understanding perception for visual and auditory perception 
(Ungerleider and Haxby, 1994; Rauschecker and Tian, 2000), and 
is also becoming a widely accepted view for speech processing 
(e.g., Hickok and Poeppel, 2004, 2007; Rauschecker and Scott, 
2009; Peelle et al, 2010; Price, 2012). 
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This kind of hierarchically organized system has formed the 
basis for understanding speech processing. For example, mod- 
els that propose predictive coding also postulate a system that 
is hierarchically organized with reciprocal connections between 
different stages of processing. Although the focus of such mod- 
els has been on online speech processing rather than adaptive 
plasticity, understanding how predictions affect changes in brain 
activity is essential for each of these processes. At the neural level, 
the predictive coding framework suggests predictions can serve to 
constrain perception through feedback signals from regions asso- 
ciated with processing information at higher levels of abstraction 
(e.g., frontal areas that are at higher levels in the speech hierar- 
chy) that modulate activity in regions associated with perceptual 
processes (e.g., temporal areas that receive the top-down mod- 
ulation) (for review see Davis and Johnsrude, 2007; Peelle et al., 
2010; Wild et al., 2012b). Thus, the literature on predictive coding 
has focused largely on changes in frontal areas (associated with 
higher-level processes) and temporal areas (associated with per- 
ceptual processes). Based on hypothesized functions of different 
brain regions, neuroimaging studies have provided some evidence 
for predictive mechanisms in speech perception (e.g., Clos et al., 
2012; Sohoglu et al, 2012; Wild et al, 2012) by examining effects 
of predictive contexts and stimulus distortions, as well as their 
interactions. 

Consistent with a hierarchically organized predictive coding 
framework, manipulation of predictive contexts modulates activ- 
ity in frontal areas, with greater activity typically observed for 
more predictive contexts (e.g., Myers and Blumstein, 2008; Gow 
et al, 2008; Davis et al, 2011; Clos et al., 2012; Wild et al., 
2012). Not surprisingly, stimulus distortions modulate activity 
in temporal areas (Davis et al., 2011; Clos et al., 2012; Wild 
et al., 2012), which are associated with early perceptual processes. 
Findings from MEG provide supporting evidence that this modu- 
lation begins early in the speech processing time course (Sohoglu 
et al., 2012). Interestingly, effects related to manipulations of 
speech signal distortion seem to depend on stimulus intelligi- 
bility, with greater activity to distortion severity for intelligible 
stimuli and decreased response to distortion severity for unin- 
telligible stimuli (Poldrack et al, 2001; Adank and Devlin, 2010). 
This U-shaped response function indicates that modulatory influ- 
ences of signal distortion in temporal cortex may be dependent 
on multiple factors. Although not all of the studies examine or 
report modulatory influences of stimulus distortions on frontal 
areas, many studies do show increases in frontal activity associ- 
ated with increases in the distortion severity (e.g., Poldrack et al., 
2001; Adank and Devlin, 2010; Eisner et al., 2010). 

Since the size of the prediction error signal depends on both 
the predictive context and the congruency of the acoustic input, 
one approach has been to examine the interaction between a 
predictive context and a stimulus distortion in order to deter- 
mine potential regions that encode error signals (Spratling, 2008; 
Gagnepain et al., 2012; Clark, 2013). A number of studies have 
shown such interactions in both temporal and frontal areas (e.g., 
Obleser and Kotz, 2010; Davis et al., 2011; Obleser and Kotz, 
2011; McGettigan et al, 2012; Sohoglu et al., 2012; Guediche 
et al., 2013). Davis et al. (2011) found an interaction between 
a semantic coherence manipulation that modulated the degree 



to which targets were predictable and an acoustic speech signal 
distortion of those targets in frontal and temporal areas, provid- 
ing evidence for the involvement of the two regions in predictive 
coding. Sohoglu et al. (2012) examined the joint effects of the 
sensory distortion of a spoken word and the informativeness of 
preceding text resolving the distorted signal, suggesting that both 
factors modulate activity in temporal cortex albeit in opposing 
directions. That is, sensory detail evoked greater response, relative 
alignment of the signal with top-down knowledge resulted in less 
response. Even more compelling evidence comes from an MEG 
study that demonstrated changes in activity in the superior tem- 
poral gyrus that were modulated based on differences between 
what was expected and what was heard. This study used a seg- 
ment prediction error task, in which the beginning segment of 
a word predicted or did not predict the end segment (formula 
vs. formubo) (Gagnepain et al., 2012). That temporal areas are 
involved in early perceptual processes and are also sensitive to this 
interaction led the authors to conclude that these areas reflect the 
encoding of prediction errors in speech perception (Clos et al., 
2012; Gagnepain et al, 2012; Sohoglu et al, 2012; Wild et al, 
2012). Together, the studies suggest that predictive coding, which 
generates feedback signals (presumably from frontal areas) mod- 
ulates temporal areas according to the predicted sensory input 
generated from the predictive coding context. 

On the other hand, evidence from other studies suggests that 
the story may be more complex. For example, across studies, sim- 
ilar manipulations have produced different patterns of changes 
in BOLD signal (e.g., Davis et al, 2011 vs. Sohoglu et al, 2012). 
Since changes in BOLD signal may reflect different aspects of 
the error signal (e.g., degree, precision) (Friston and Kiebel, 
2009; Hesselmann et al., 2010), there are still many open ques- 
tions about the role of different regions in predictive coding. 
Furthermore, some interactions cannot be completely accounted 
for by a predictive coding framework (McGettigan et al., 2012; 
Guediche et al., 2013). In the predictive coding framework, acti- 
vation within areas reflecting prediction error signals should 
increase as the degree of discrepancy between the expected and 
actual input increases. However, some studies have shown inter- 
dependent modulatory influences, for example, McGettigan et al. 
(2012) showed that responses to the quality or clarity of the acous- 
tic stimulus depended on the predictability of the context as well 
other factors associated with the stimulus properties (e.g., intelli- 
gibility). That a predictive context may lead to either increased or 
decreased activity as a function of the intelligibility of the stimu- 
lus (in temporal and/or parietal areas) (McGettigan et al., 2012; 
Guediche et al., 2013) suggests that the generation of predic- 
tion error signals may be informed by the integration of multiple 
sources of information, and not solely by the computation derived 
from a predictive context. 

Above, we suggested that adaptive plasticity is guided by super- 
visory signals derived from discrepancies between expected and 
actual sensory input. The evidence we reviewed from recent stud- 
ies in speech perception examining predictive coding in online 
speech perception is beginning to reveal the cortical networks 
engaged by tasks that manipulate signal distortions and predic- 
tive contexts. To date, the findings related to interactions between 
predictive contexts and stimulus distortions provide support for 
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a dynamic speech processing framework where predictions can 
be generated from contextual sources of information and be 
used to derive prediction error signals. In the predictive cod- 
ing framework the error signal presumably is used to optimize 
future predictions and drive learning mechanisms that lead to 
adaptive plasticity (Clark, 2013). Despite potential similarities 
between the mechanisms underlying these effects [although see 
Norris et al. (2003) for a different view], adaptive plasticity dif- 
fers from the online effects of predictive context on interpreting 
distorted speech acoustics in that it impacts subsequent per- 
ception of speech even once disambiguating contexts are no 
longer available. While it is possible that predictive coding pro- 
vides a means of generating prediction error signals that can be 
used to supervise adaptive plasticity, it is not clear how changes 
in activity related to predictive coding could give rise to the 
adaptive plasticity effects evident in the behavioral literatures 
reviewed above. Although many details about prediction-error- 
signal driven learning remain to be discovered, it is uncontrover- 
sial that the brain integrates incoming sensory information with 
prior perceptual, motor, and cognitive knowledge to arrive at a 
unified perceptual experience. 

NEUROIMAGING EVIDENCE FOR ADAPTIVE PLASTICITY IN SPEECH 
PERCEPTION 

In an attempt to dissociate neural changes directly related to adap- 
tive plasticity from modulatory effects of factors such as predictive 
context and stimulus distortions, we review studies that have 
specifically investigated changes in neural activity associated with 
adaptive plasticity (Adank and Devlin, 2010; Eisner et al., 2010; 
Kilian-Hutten et al, 2011a,b; Erb et al, 2013). Although tasks 
(word recognition and acoustic phonetic perception) and stim- 
ulus manipulations (noise-vocoded, time-compressed, ambigu- 
ous) vary across these studies, collectively they implicate the 
involvement of premotor, temporal, parietal, and frontal areas in 
adaptive speech perception. 

In word recognition studies, evidence for the recruitment of 
temporal and premotor areas is consistent across studies. Adank 
and Devlin (2010) examined adaptive plasticity during exposure 
to time-compressed speech and showed increased activation in 
bilateral auditory cortex and left ventral premotor cortex associ- 
ated with adaptation. They concluded that under adverse listening 
conditions, such as time compression, the dorsal motor stream 
is recruited to facilitate disambiguation of the speech signal. In 
a recent word recognition study, Erb et al. (2013) showed that 
greater changes in activity in precentral gyrus were associated 
with greater adaptive plasticity after exposure to a noise-vocoded 
speech distortion. The involvement of the motor system is con- 
sistent with prior work suggesting that motor recruitment may 
facilitate the resolution of perceptually ambiguous speech signals 
under difficult listening conditions (e.g., Davis and Johnsrude, 
2003, 2007; Rauschecker, 2011; Szenkovitz et al, 2012). 

The recruitment of other regions may also be important for 
adaptive plasticity. Eisner et al. (2010) examined adaptation to a 
speech distortion that simulated cochlear-implant speech input 
and found that activity in superior temporal cortex and inferior 
frontal gyrus corresponded with improvements in intelligibil- 
ity with training. They also found that learning over the course 



of the experiment corresponded to modulation of activity in a 
parietal area — specifically, the angular gyrus. The angular gyrus 
may be ideally suited for guiding the adaptation process, as its 
functional and structural connectivity with other brain regions 
suggests that it may provide a point of convergence for motor, 
sensory, and more abstract linguistic information (Binder et al., 
2009; Friederici, 2009; Turken and Dronkers, 2011). Guediche 
et al. (accepted) also showed differences in frontal and tempo- 
ral areas before vs. after adaptation to vocoded and spectrally- 
shifted speech. Taken together, changes in frontal, temporal, and 
premotor areas have been associated with manipulations of dis- 
ambiguating contexts context and the severity/intelligibility of the 
distorted stimuli. 

Fewer studies have investigated visually- and lexically- 
mediated adaptive plasticity of acoustic phonetic perception 
using neuroimaging. One study examined visually-mediated 
adaptive plasticity using videos of articulating faces to disam- 
biguate ambiguous acoustic speech stimuli (Kilian-Hutten et al., 
201 la). As in the behavioral study by Bertelson et al. (2003), expo- 
sure to an ambiguous token paired with a video of a face clearly 
articulating one of the phonetic alternatives led the ambiguous 
token to be perceived more often as the alternative consistent 
with the articulating face in a later acoustic phonetic percep- 
tion task. Kilian-Hutten et al. (2011a) showed that the perceptual 
interpretation of the ambiguous sounds could be decoded with 
multi-voxel pattern analysis in temporal areas (adjacent to and 
encompassing Heschl's gyrus). This demonstrates a change in the 
neural pattern of activity consistent with the perceptual change 
relatively early in auditory cortical networks caused by adap- 
tive plasticity. In order to identify regions involved in learning, 
Kilian-Hutten et al. (201 lb) examined how brain activity during 
adaptation was related to later perception of the ambiguous stim- 
uli. They found that the visually-mediated adaptive plasticity of 
acoustic phonetic perception corresponded to changes in activ- 
ity in a network of areas including frontal, temporal, and parietal 
areas (Kilian-Hutten et al., 2011b). 

To our knowledge, only one neuroimaging study has exam- 
ined lexically-mediated adaptive plasticity in acoustic phonetic 
perception (Mesite and Myers, 2012). Similar to the behavioral 
study by Krab'ic and Samuel (2005), two groups of participants 
were exposed to ambiguous [s]-[/] tokens in different biasing 
lexical contexts. They showed between-group changes in sub- 
sequent acoustic phonetic perception of the ambiguous tokens 
presented without lexically-disambiguating contexts. The behav- 
ioral changes in acoustic phonetic perception were associated 
with differences in the activity of right frontal and middle tem- 
poral areas. The limited data that exist thus suggest that, similar 
to the findings from word recognition studies, adaptive plasticity 
evidenced in acoustic phonetic perception of ambiguous phonetic 
categories engages a network of frontal, temporal, and parietal 
areas. 

Because of the use of different stimuli, tasks (examining con- 
text effects vs. adaptation effects), and analyses (focusing on 
specific changes and sometimes specific regions) across studies, 
many questions remain open. Furthermore, even though there 
is a great deal of evidence supporting the multiple stream view 
of speech processing, there is still debate regarding the role of 
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specific regions in speech and language processes. Despite these 
caveats, the current evidence is consistent with a view that frontal 
(e.g., inferior frontal and middle frontal gyrus) and temporal 
areas (e.g., superior temporal and middle temporal gyrus) are 
sensitive to context and stimulus properties. Frontal areas may 
provide the source of the predictive feedback, potentially involv- 
ing different frontal areas for different sources of contextual 
information (Rothermich and Kotz, 2013) and may modulate 
activity in temporal areas associated with earlier perceptual pro- 
cesses (Gagnepain et al., 2012). Changes in brain activity related 
to adaptive plasticity may rely more specifically on the recruit- 
ment of higher association areas (e.g., parietal cortex) that seem 
to relate more directly to adaptive plasticity (Obleser et al., 2007; 
Eisner et al, 2010; Guediche et al, accepted). In all, the lit- 
eratures investigating the neural basis of predictive coding and 
adaptive plasticity complement one another and can be lever- 
aged for developing and refining a more detailed model of the 
dynamic, flexible nature of speech perception. 

Despite these advances in our understanding of how specific 
cortical regions may contribute to a dynamically adaptive speech 
perception network, presently, there is no formal speech percep- 
tion model that relates activity in the cortical regions identified 
via neuroimaging to the computational demands of adaptive plas- 
ticity in speech perception. Conversely, the classic computational 
models of speech perception that have attempted to differentiate 
how the system may meet the computational demands of adap- 
tive plasticity have not made specific predictions of the underlying 
neural mechanisms. Next-generation models will need to bridge 
this divide to explain how adaptive changes in perception are 
reflected in brain activity and how they take place without under- 
mining the stability of and sensitivity to long-term regularities. 

We next examine literatures outside of speech perception 
for insight into how we may make progress toward meeting 
these challenges. Inasmuch as it relates to the dual demands of 
maintaining long-term representations that respect regularities 
of the environment while flexibly adjusting perception to short- 
term deviations from these regularities, adaptive plasticity is not 
unique to speech perception. Preserving the balance between sta- 
bility and plasticity is important for perceptual, motor and cogni- 
tive processing in many domains. Consequently, research outside 
the domain of speech perception may provide insight regarding 
the development of a biologically plausible account of adap- 
tive plasticity in speech processing that captures the significant 
behavioral characteristics we outlined above. 

INSIGHTS FROM NEUROSCIENCE 

Thus far, research on the neural basis of adaptive plasticity in 
speech perception has been largely focused on cerebral cor- 
tical regions. In the section that follows, we argue that the 
cerebellum plays a role in adaptive plasticity in speech percep- 
tion. Specifically, we review evidence from sensorimotor learn- 
ing for cerebellar involvement in perception, predictive coding, 
and adaptive plasticity. We consider the potential importance 
of cerebro-cerebellar interactions in generating prediction errors 
derived from discrepancies between predicted and actual sen- 
sory input. Such a mechanism may provide a way to unite 
the seemingly distinct behavioral speech perception phenomena 



we reviewed above. Finally, we propose that such a mechanism 
may be especially relevant since it offers a means to achieve 
rapid adjustment of perception in response to short-term devi- 
ations without undermining the stability of learned long-term 
regularities. 

It may seem surprising to consider the cerebellum as part of a 
network involved in perceptual plasticity as, historically, the cere- 
bellum has been considered a primarily motor structure. Since 
many neuroimaging studies of speech perception are focused on 
changes in perisylvian areas, data collection and/or analyses often 
fail to consider the cerebellum. However, outside the domain of 
speech perception, there has been increased interest in the cere- 
bellum's role in non-motoric functions, with some limited but 
compelling evidence that it is involved in cognitive functions, 
including language (Fiez et al, 1992; Desmond and Fiez, 1998; 
Thach, 1998; Strick et al, 2009; although see Glickstein, 2006 
for debate). This perspective posits that the cerebellar system 
plays an important role in supervised learning across many dif- 
ferent domains through the manipulation of internal models (Ito, 
2008). We next briefly review evidence for cerebellar involvement 
in sensorimotor adaptation. 

CEREBELLAR-DEPENDENT SUPERVISED LEARNING IN 
SENSORIMOTOR TASKS 

In the sensorimotor domain, the underlying mechanisms of adap- 
tation to sensory input distortions have been explored extensively, 
with multiple lines of evidence underscoring the significance of 
the cerebellum. A classic behavioral task demonstrating sensori- 
motor adaptation is visually-guided reaching while wearing prism 
goggles (e.g., Martin et al., 1996b). When prism goggles that 
shift the visual field several degrees distort sensory input, motor 
behavior in a visually guided reaching task is impacted. Initially, 
reaches are off-target. However, participants rapidly adapt to the 
distorted sensory input across 10-20 reaches, as evidenced by 
successful on-target reaching (Martin et al, 1996b). Such senso- 
rimotor adaptation is observed across many stimulus distortions 
and motor behaviors (Kawato and Wolpert, 1998; Wolpert et al., 
1998). Clinical studies examining performance on sensorimotor 
tasks in patients with cerebellar damage (Martin et al., 1996a; 
Ackermann et al., 1997), functional neuroimaging studies exam- 
ining changes in neural activity in short-term adaptation tasks 
(Clower et al., 1996), and lesion studies with non-human pri- 
mates (Kagerer et al, 1997; Baizer et al., 1999) all implicate the 
cerebellum as having an important role in such sensorimotor 
adaptation. 

The role of the cerebellum in sensorimotor adaptation has 
been attributed largely to supervised learning mechanisms based 
on internally-generated sensory prediction errors (e.g., Doya, 
2000; Shadmehr et al., 2010). Cerebellar-dependent supervised 
learning within the context of sensorimotor adaptation is thought 
to rely on the internal generation of sensory prediction error sig- 
nals derived from discrepancy between the predicted and actual 
sensory input (Wolpert et al., 201 1). The predicted sensory input 
is the expected outcome of a planned movement (a reach, for 
example) and can thus be derived from the "internal model" 
of the input-output relationship of sensory and motor informa- 
tion. With repeated visually-guided reaches while wearing prism 
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goggles, for example, the sensory prediction errors reconfigure 
the relationship among visual, motor, and proprioceptive infor- 
mation sources to optimize future predictions and minimize error 
signals, leading to adaptation evidenced by more accurate reach- 
ing on subsequent trials (Kawato and Wolpert, 1998; Bedford, 
1999; Desmurget and Grafton, 2000; Flanagan et al., 2003; Scott 
and Wise, 2004; Shadmehr et al., 2010; Clark, 2013). 

Such sensorimotor adaptation is also evident in the domain 
of speech. Adaptation is observed when speakers experience sen- 
sory input distortions while talking, such as through real-time 
manipulation of voice acoustics to alter acoustic feedback from 
one's own voice or via somatosensory perturbations that alter 
the feel of speech articulation (e.g., Houde and Jordan, 1998, 
2002; Perkell et al., 2007; Villacorta et al, 2007; Shiller et al., 
2009; Golfmopoulos et al., 2011; Chang et al, 2013). Speakers 
quickly adjust their production in a direction that compensates 
for the sensory input distortion (Houde and Jordan, 1998). In this 
way, speech production exhibits compensatory motor changes in 
response to distorted sensory input just as observed for other sen- 
sorimotor tasks (Houde and Jordan, 1998, 2002; Jones, 2003). 
A range of acoustic manipulations has been examined including 
shifts in fundamental frequency, vowel formant frequency, and 
the timing of auditory speech feedback (Houde and Jordan, 1998; 
Jones and Munhall, 2000; Perkell et al, 2007). These shifts can be 
quite extreme. In one study, participants produced a completely 
different vowel sound relative to the intended target after they 
were exposed to vowel formant shifts (Houde and Jordan, 1998). 

Neuroanatomical models of speech production have incorpo- 
rated the idea of internal models that represent the relationship 
between the sensory input and motor output (Guenther, 1995; 
Guenther and Ghosh, 2003; Kotz and Schwartze, 2010; Tian and 
Poeppel, 2010; Price et al, 2011). Guenther (1995); Guenther and 
Ghosh (2003) developed a neuroanatomically-based computa- 
tionally model of speech production that incorporates expected 
relationships between a desired sensory outcome, the motor com- 
mands that should produce this outcome, and the actual sensory 
consequences of the produced speech. The DIVA (Directions Into 
Velocities of Articulators) model consists of several cerebral corti- 
cal areas that interact with the cerebellum, forming a network that 
guides sensorimotor adaptation in speech production. Through 
these interactions, internal models can be used to detect and 
correct errors under sensory input perturbations. Neuroimaging 
studies of sensorimotor adaptation in speech production have 
yielded results consistent with predictions from this model. In 
a study that investigated somatosensory perturbations by using 
a device to block jaw movement, increases in the BOLD signal 
were observed across left inferior frontal gyrus, ventral premo- 
tor cortices, supramarginal gyri, and the cerebellum, consistent 
with the model's predictions. These results provided support for 
the view that cerebro-cerebellar interactions are involved in sen- 
sorimotor adaptation in speech (Golfmopoulos et al., 2011). A 
recent study by Zheng et al. (2013) suggests that multiple interact- 
ing functional networks are involved in coding different aspects of 
the error signals. As reviewed briefly above, although the speech 
production literature has focused largely on cerebral cortical areas 
(e.g., Price et al, 2011; but see Guenther and Ghosh, 2003), 
there is convergent evidence from other literatures that supervised 



prediction error learning involves cerebro-cerebellar interactions 
(Doya, 2000; Ito, 2008; Wolpert et al, 201 1). In the current speech 
production models, generation of prediction error signals may 
relate to those in speech perception either through the sensory 
expectations that are generated from internal speech processes 
(e.g., Tian and Poeppel, 2010) or from phonological information 
(e.g., Price et al, 2011). 

There is still debate regarding the role of the motor system 
in generating predictions during speech perception. Pickering 
and Garrod (2007) suggested that multiple levels of linguistic 
information (e.g., semantic, syntactic) engage speech produc- 
tion processes to generate predictions. More recently, Tian and 
Poeppel (2013) instructed participants to engage in overt speak- 
ing, covert/imagined speaking, or imagined hearing and found 
that there may be differences in how predictions are generated 
depending on the nature of the speaking tasks participants were 
engaged in. Tian and Poeppel (2013) suggest that linguistic infor- 
mation retrieved from memory, as well as inner speech processes, 
can be used to generate predictions and modulate activity in 
regions associated with perceptual processes. This is consistent 
with models of visual perception, which also suggest that mul- 
tiple sources of information can provide feedback to early visual 
areas (Mumford, 1992; Rao and Ballard, 1999). Thus, cerebellar- 
dependent supervised learning mechanisms may contribute to 
adaptive plasticity in speech perception that may operate on 
prediction error signals derived directly from different sources 
of linguistic information, indirectly from inner speech motor 
processes, or both. 

Although the focus of research has been, and continues to be, 
on cerebellar contributions to the adaptive control of movement 
through sensorimotor adaptation, there is mounting evidence 
that the cerebellum is also involved in many other perceptual 
(Ivry, 1996; Petacchi et al., 2005) and cognitive behaviors (Fiez 
et al, 1992; Desmond and Fiez, 1998; Thach, 1998; Strick et al, 
2009). At the outset, we noted that the cerebellum is increas- 
ingly recognized to play an important role in supervised learning, 
across many domains, through the manipulation of internal 
models (Ito, 2008). In sensorimotor learning, sensory prediction 
errors realign internal models of sensorimotor relationships. If 
the role of the cerebellum is more general, it is possible that it is 
involved in supervised learning that serves to align sensory input 
with predictions arising from nonmotor sources thus extending 
cerebellar-dependent supervised learning outside sensorimotor 
domains, (e.g., Doya, 2000; Ito, 2008; Strick et al, 2009). 

Indeed, in a nonmotor perceptual task, recent evidence points 
to cerebellar involvement in perception of spatiotemporal rela- 
tionships. Roth et al. (2013) recently demonstrated that cerebellar 
patients are impaired in their ability to adapt to discrepancies in 
a nonmotor task that relies on spatio-temporal judgments about 
a visual target. This study provides direct evidence of cerebellar 
involvement in perceptual adaptation within an entirely non- 
motor task that is not dependent on the consequences of one's 
own motor behavior. There is also evidence that the cerebellum 
is involved in encoding acoustic sensory prediction error signals 
in a nonmotor task. Schlerf et al. (2012) showed that activity in 
the cerebellum is modulated by sensory changes in an acous- 
tically presented stimulus (Schlerf et al., 2012), and different 
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forms of predictive information (Rothermich and Kotz, 2013). 
In sum, intriguing recent results, even outside the domain of 
speech perception, suggest the possibility of cerebellar involve- 
ment in supervised learning that extends beyond sensorimotor 
interactions. 

In light of known interactions between perception and pro- 
duction, a relationship between the mechanisms that underlie 
sensorimotor and sensory adaptation seems likely. In fact, even 
sensorimotor adaptation can evoke "purely" perceptual shifts that 
are unaccounted for by changes in motor output (e.g., Shiller 
et al, 2009; Nasir and Ostry, 2009; Mattar et al, 2011). For 
example, Shiller et al. (2009) demonstrated that after sensorimo- 
tor adaptation of speech production induced by altered auditory 
feedback of a listener's own I J I (as in ship) productions, sub- 
sequent perception of another talker's lsl-1 f I (as in sip to ship) 
sounds was also shifted. Thus, the consequences of sensorimotor 
adaptation (attributed to cerebellar supervised learning mecha- 
nisms) may have a perceptual component that is unrelated to 
changes in motor output. 

The link between sensorimotor adaptation and sensory adap- 
tation, together with recent evidence implicating the cerebellum 
in purely perceptual adaptation (e.g., Roth et al., 2013) suggest 
that the supervised learning mechanisms posited for sensorimo- 
tor adaptation in speech (Houde and Jordan, 1998; Jones and 
Munhall, 2000; Guenther and Ghosh, 2003; Shiller et al, 2009) 
can also provide a framework for understanding adaptive plastic- 
ity in speech perception. In speech perception, predictions about 
sensory input may be derived from multiple sources of informa- 
tion (e.g., lexical, visual) that constrain listeners' interpretation of 
incoming acoustic signals. 

Guediche et al. (accepted) recently examined the potential 
for cerebellar contributions to adaptive plasticity in speech per- 
ception. To this end, they examined neural activity linked to 
improvements in recognition of acoustically distorted words. 
Several cerebellar regions showed significantly different activa- 
tion before, compared to after, adaptation to acoustically dis- 
torted words. Activity in one region, right Crus I (previously 
implicated in language tasks; Stoodley and Schmahmann, 2009; 
Keren-Happuch et al, 2012) was significantly correlated with 
behavioral improvement measures of adaptive plasticity during 
the adaptation phase of the experiment. A seed functional cor- 
relation analysis revealed that hemodynamic responses in right 
Crus I during adaptation significantly covaried with areas in pari- 
etal and temporal cortices. This evidence is consistent with prior 
functional neuroimaging findings implicating these cerebral cor- 
tical regions in adaptive plasticity (e.g., Eisner et al., 2010), and 
extends those prior findings to include the cerebellum as part of 
a cerebro-cortical functional network that contributes to adaptive 
changes in speech perception. 

In sum, the recent theoretical development and empirical 
investigation of predictive coding and adaptive plasticity in 
speech processing, as reviewed above, offers a framework for 
understanding how prediction errors may be computed, repre- 
sented, and used to optimize perception. Although prior neu- 
roimaging studies of speech perception adaptation and predictive 
coding have specifically focused on changes in cerebral corti- 
cal areas, the converging lines of evidence described above are 



consistent with the involvement of cerebellar-supervised learn- 
ing via cerebro-cerebellar interactions. We are proposing that the 
cerebellum plays a key role in adaptive plasticity and critically 
provides a mechanism that can allow for plasticity in the context 
of a stable perceptual system. In particular, the cerebellum pro- 
vides an established neural mechanism known to be involved in 
rapid adaptive plasticity. More research will be needed to exam- 
ine this issue but this hypothesis provides a working framework 
for examining the dual roles of stability and plasticity in cognitive 
systems generally, and in speech perception in particular. 

Finally, with regard to maintaining stability it is notable that 
there is evidence for the possibility that the cerebellum (poten- 
tially through interactive loops with cerebral cortex) can maintain 
multiple adaptive adjustments to internal models (Cunningham 
and Welch, 1994; Martin et al, 1996b; Imamizu et al, 2003). This 
provides the means for rapid and short-term adaptive plastic- 
ity that can be implemented without catastrophically affecting 
the stability of long-term regularities. Most germane to adap- 
tive plasticity in speech perception, it presents the opportunity 
for multiple relationships between acoustic input and linguistic 
information to be simultaneously represented, such as might be 
necessary to maintain adaptation to different speakers or different 
accents. Thus, future neuroimaging efforts should be attentive to 
including the cerebellum (and potentially other subcortical struc- 
tures) in the network of regions investigated as contributing to 
adaptive plasticity in speech perception. 

CONCLUSIONS AND FUTURE DIRECTIONS 

Everyday speech communication largely takes place in suboptimal 
or even adverse listening conditions, at least relative to the pris- 
tine listening environments in which most research is conducted. 
The acoustic speech signals most often conveying meaning to 
listeners in everyday conversation carry the influence of noisy 
environments, foreign accented talkers, reduced conversational 
speech, and dysfluency (see Mattys et al., 2012). We have reviewed 
several parallel behavioral literatures that demonstrate that the 
perceptual system makes rapid adaptive adjustments in response 
to distorted acoustic speech input. We make the case that these 
largely unconnected behavioral literatures, which have focused 
on different aspects of speech processing (spoken word recog- 
nition and acoustic-phonetic perception) may, in fact, be linked 
by common factors. We have reviewed computational modeling 
in the speech perception and neuroscience literatures within and 
outside the field of speech communication. We have considered 
how these literatures speak to prospective mechanisms and their 
ability to unite the behavioral literatures on adaptive plasticity in 
word recognition and acoustic phonetic perception. In addition, 
we considered two separate, but complementary, neuroimaging 
literatures on predictive coding and adaptive plasticity, with the 
goal of informing the mechanistic basis of adaptive plasticity in 
speech perception. Both predictive coding and adaptive plastic- 
ity models posit mechanisms for encoding error signals when 
there is a discrepancy between predicted and actual sensory input. 
Supervised learning mechanisms that rely on prediction error sig- 
nals for rapid adaptive plasticity have been well-established in the 
sensorimotor literature, including speech production adaptation 
tasks, and have been attributed to cerebro-cerebellar interactions. 
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More recently, they have been implicated in nonmotor, perceptual 
tasks including speech perception. We posit that these findings 
suggest prediction error-driven learning orchestrated via cerebro- 
cerebellar interactions may play a role in adaptive plasticity in 
speech perception. 

Based on the synthesis of these literatures, we argued that the 
generation of predictions, prediction error signals, and supervised 
learning may be significant in driving adaptive plasticity. In par- 
ticular, we highlighted the potential for a cerebellar-dependent 
supervised learning mechanism to play a role in adaptive plastic- 
ity in speech perception and described preliminary evidence that 
supports this possibility. This perspective suggests some direc- 
tions for future research that will better develop neurobiological 
models of speech communication that capture the dynamic, 
online flexibility of the system. 

Although a great deal of evidence points to the importance 
of subcortical-cortical interactions in adaptive plasticity in other 
domains, the mainstream literature on speech perception has yet 
to make significant contact with the literature on subcortical con- 
tributions to adaptive plasticity. Neuroscience research relevant 
to adaptive plasticity in speech perception and, indeed to speech 
perception more generally, has tended to be be focused on the 
cerebrum. Although we know less about contributions of subcor- 
tical structures in speech perception, there have been a number 
of studies that have highlighted roles for the cerebellum, thala- 
mus, caudate, and the brainstem that may be defined by specific 
functions, or interactions with specific regions in cerebral cortex 
(Ravizza, 2003; Tricomi et al, 2006; Song et al., 2008, 2011, 2012; 
Stoodley and Schmahmann, 2009; Anderson and Kraus, 2010; 
Stoodley et al, 2012; Erb et al, 2013). 

In the broader neuroscience literature, developing perspectives 
have suggested that different types of learning mechanisms may 
be subserved by different neural systems. At least three types of 
potentially distinct and interacting learning circuits have been 
proposed for unsupervised, reinforcement, and supervised learn- 
ing (see Doya, 2000; Hoshi et al, 2005; Bostan et al, 2010; 
Wolpert et al, 2011). Doya (2000) suggested that unsupervised 
learning algorithms depend mostly on long-term changes in cere- 
bral cortex that can be incorporated over longer timecourses 
(Doya, 2000). Reinforcement learning, on the other hand, relies 
on information to predict reward outcomes. In speech percep- 
tion, reinforcement learning has been examined in the context of 
non-native category learning. In a functional neuroimaging study, 
Tricomi et al. (2006) examined learning with performance feed- 
back and found that basal ganglia activity was modulated by the 
presence of feedback during a non-native phonetic category per- 
ception task just as they are in other reinforcement learning tasks 
(e.g., Delgado et al., 2000). Whereas reinforcement learning may 
optimize subsequent reward prediction error and engage the basal 
ganglia, supervised learning may optimize sensory prediction 
error signals by engaging the cerebellum. 

In speech perception, both unsupervised and supervised learn- 
ing mechanisms have been used to account for adaptive plasticity 
(Norris et al, 2003; Mirman et al, 2006). Outside the domain of 
speech perception, unsupervised learning mechanisms are gen- 
erally used to model learning that arises over longer time courses 
(McClelland et al., 1995; O'Reilly, 2001; Grossberg, 2013) than the 



learning that characterizes adaptive plasticity. Supervised learning 
in models of speech perception have not accounted for many 
known behavioral and biological constraints, However, outside 
the domain of speech perception, recent models have explored 
a number of alternatives for achieving neurobiologically plausible 
supervised learning algorithms (e.g., Yu et al., 2008; Chinta and 
Tweed, 2012). 

In speech, there is behavioral evidence that listeners can 
achieve greater levels of adaptation that go beyond those reached 
with rapid adaptation training paradigms, if they are exposed to 
multiple sessions with consolidation (Banai and Lavner, 2012). 
Improvements in word recognition for distorted acoustic input 
degrade over the course of a day-long retention interval, but are 
fully restored with sleep; sleep thus appears to stabilize what is 
learned in adaptation to distorted speech (Fenn et al., 2003), 
with word recognition improvements lasting as long as 6 months 
(Schwab et al., 1985). Thus, a fully mechanistic account of speech 
processing will require an understanding of how and to what 
extent different learning mechanisms interact with one another 
to influence speech processing. Some computational accounts of 
perception have begun to incorporate different types of learn- 
ing algorithms within single systems (Hinton and Plaut, 1987; 
O'Reilly, 2001; Kleinschmidt and Jaeger, 2011; Grossberg, 2013). 
One challenge for models of speech processing is to account for 
the equilibrium that must be maintained between mechanisms 
involved in preserving stability while supporting plasticity. 

In light of the parallels we have drawn between adaptive 
plasticity in speech perception and sensorimotor adaptation, 
it is interesting to note that research has demonstrated reten- 
tion of sensorimotor adaptation effects over more than a year 
(Yamamoto et al., 2006) suggesting that cerebellar-dependent 
supervised learning can evoke changes in internal models that 
are maintained across long time periods. Yamamoto et al. spec- 
ulate that the extent to which sensorimotor adaptation is retained 
depends on an interaction between the number of training trials 
and the magnitude of the distortion, with more subtle distortions 
leading to longer-lasting adaptation perhaps because they evoke 
smaller errors and avoid engaging explicit compensation mecha- 
nisms (Redding and Wallace, 1996). These issues have not been 
investigated in the adaptive plasticity of speech perception, but 
have important implications for long-lasting adaptation in speech 
perception. Understanding the details of the interplay between the 
different types of learning mechanisms will be crucial for under- 
standing how the system maintains balance between stability and 
plasticity in speech perception. 

Beyond delineating the learning mechanisms available to guide 
adaptive plasticity in speech perception, there also are many 
open questions regarding the nature of putative prediction errors 
and how predictions may be derived from various information 
sources. The field has focused much attention on the role of lexical 
information in driving adaptive plasticity. Other sources of infor- 
mation, such as co-speech gestures from arm and hand move- 
ments associated with speech communication (Skipper et al., 
2009), semantic or sentence context (e.g., Borsky et al., 1998; 
Zekveld et al., 2012), knowledge about the speaker (Samuel and 
Kraljic, 2009), or previously learned voice-face associations (von 
Kriegstein et al., 2008) may provide a basis for disambiguating 
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distorted acoustic input via prediction errors and, potentially, 
may drive adaptive plasticity. Indeed, in more natural communi- 
cation, many different information sources converge to constrain 
predictions and disambiguate acoustic speech input. The emerg- 
ing framework we have begun to sketch unites the means by which 
these very different information sources drive adaptive plasticity 
in speech perception. These other sources of information pro- 
vide a constraint on the predictions the system makes about the 
intended message and, in turn, affect the sensory prediction that 
is made and the prediction error that results. Moreover, since both 
internally-generated and external sensory input inform predic- 
tions, it becomes easier to reconcile seemingly distinct influences 
of acoustic sensory distortions and higher-level influences such 
as expectations about speaker- or context-specific factors that 
influence speech, (Kraljic et al, 2008; Kraljic and Samuel, 2011). 

In conclusion, evidence for a flexible speech perception sys- 
tem that rapidly adapts to accommodate systematic distortions in 
acoustic speech input is abundant. A review of behavioral, com- 
putational, and neuroscience research related to rapid adaptive 
mechanisms suggests that it may be informative to consider phe- 
nomena in literatures outside of speech communication to iden- 
tify common and unifying principles of how the brain balances 
stability and plasticity. Here, we examined cerebellar-dependent 
supervised learning that relies on sensory prediction error sig- 
nals as a potential mechanism for supervising adaptive changes in 
speech perception. The predictions used to derive the error signals 
may be generated from multiple interacting sources of external 
sensory and internally-generated information. By incorporating 
cerebral-subcortical interactions established in other literatures 
into neuroanatomical theories of speech perception, the mech- 
anisms that contribute to stability and plasticity may be better 
understood. 
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